Owner: Daniel Soukup - Created: 2025.11.01
In this notebook, we load the processed data and fit our models.
Let's load our processed data and create feature/target dataframes for both train and test.
# -*- coding: utf-8 -*-
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
# Read recipe inputs
processed_learn = dataiku.Dataset("processed_learn")
processed_learn_df = processed_learn.get_dataframe()
processed_test = dataiku.Dataset("processed_test")
processed_test_df = processed_test.get_dataframe()
processed_learn_df.shape, processed_test_df.shape
((152807, 44), (78826, 44))
processed_learn_df.head()
| class of worker_Not in universe | class of worker_Private | class of worker_Self-employed-not incorporated | class of worker_infrequent_sklearn | sex_Male | education_Children | education_High school graduate | education_Some college but no degree | education_infrequent_sklearn | marital stat_Married-civilian spouse present | marital stat_Never married | marital stat_Widowed | marital stat_infrequent_sklearn | full or part time employment stat_Full-time schedules | full or part time employment stat_Not in labor force | full or part time employment stat_infrequent_sklearn | detailed household and family stat_Child <18 never marr not in subfamily | detailed household and family stat_Householder | detailed household and family stat_Nonfamily householder | detailed household and family stat_Spouse of householder | detailed household and family stat_infrequent_sklearn | detailed household summary in household_Child under 18 never married | detailed household summary in household_Householder | detailed household summary in household_Other relative of householder | detailed household summary in household_Spouse of householder | detailed household summary in household_infrequent_sklearn | num persons worked for employer_1 | num persons worked for employer_2 | num persons worked for employer_3 | num persons worked for employer_4 | num persons worked for employer_6 | num persons worked for employer_infrequent_sklearn | family members under 18_Not in universe | family members under 18_infrequent_sklearn | tax filer stat_Nonfiler | tax filer stat_Single | tax filer stat_infrequent_sklearn | income | age | wage per hour | capital gains | capital losses | dividends from stocks | weeks worked in year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0 | 73 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 58 | 0 | 0 | 0 | 0 | 52 |
| 2 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0 | 18 | 0 | 0 | 0 | 0 | 0 |
| 3 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0 | 9 | 0 | 0 | 0 | 0 | 0 |
| 4 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0 | 10 | 0 | 0 | 0 | 0 | 0 |
We notice that some special character can cause issues in training - we address this here.
processed_learn_df.columns = [col.replace("<", "less").replace(">", "more") for col in processed_learn_df.columns]
processed_test_df.columns = [col.replace("<", "less").replace(">", "more") for col in processed_test_df.columns]
TARGET = 'income'
X_train, y_train = processed_learn_df.drop(columns=TARGET), processed_learn_df[TARGET]
X_test, y_test = processed_test_df.drop(columns=TARGET), processed_test_df[TARGET]
Recall that 8% of the processed samples fall into the target class 1 (high income) so a dummy classifier predicting 0 only would be 92% accurate.
y_train.mean()
0.08062457871694359
Important Note: We won't use the test set for any optimization to avoid overfitting, we reserve the test set for final evaluation only of the optimized model as an unbiased estimate of our model on completely unseen data.
Our current approach will focus on optimizing an XGBoost binary classifier. We do this using Optuna to search the hyperparameter space efficiently. We also aim to address the class imbalance during the training by:
As mentioned, we'll be using sample weights to adjust for the class imbalance:
from sklearn.utils.class_weight import compute_sample_weight
from typing import Union
def get_sample_weights(multiplier: Union[int, None]) -> np.array:
"""
Weight the minority sample higher to contribute more to the training loss.
"""
if multiplier:
return compute_sample_weight({0: 1, 1: multiplier}, y_train)
else:
return None
weights = get_sample_weights(10)
weights
array([1., 1., 1., ..., 1., 1., 1.])
Next, we set up our cross-validation helper:
from typing import Dict, Any
import xgboost as xgb
from xgboost import XGBClassifier
def cross_val_score_xgb(param: Dict[str, Any]) -> float:
"""
Fit model with 3-fold cross validation using the provided params.
Return the avg out-of-fold metric as specified in the provided params.
"""
dtrain = xgb.DMatrix(
X_train,
label=y_train,
weight=get_sample_weights(param.get('multiplier')),
)
results = xgb.cv(
params=param,
dtrain=dtrain,
num_boost_round=param.get("n_estimators"), # default 10
nfold=3,
seed=42,
verbose_eval=False,
stratified=param.get("stratified_cv"), # default False
)
return results
# we wont change these
BASE_PARAMS = {
"verbosity": 0,
"objective": "binary:logistic",
"eval_metric": "aucpr", # adjusted for the imbalance
"stratified_cv": True
}
param = BASE_PARAMS.copy()
param.update({
"n_estimators": 10,
"max_depth": 2,
"multiplier": 1,
}
)
Lets test our function:
results = cross_val_score_xgb(param)
results.tail(3)
| train-aucpr-mean | train-aucpr-std | test-aucpr-mean | test-aucpr-std | |
|---|---|---|---|---|
| 7 | 0.520371 | 0.007912 | 0.518399 | 0.005931 |
| 8 | 0.529616 | 0.007390 | 0.526744 | 0.002086 |
| 9 | 0.531546 | 0.007320 | 0.528351 | 0.001600 |
Lets try with a large multiplier:
param.update({
"n_estimators": 10,
"max_depth": 2,
"multiplier": 10,
}
)
results = cross_val_score_xgb(param)
results.tail(3)
| train-aucpr-mean | train-aucpr-std | test-aucpr-mean | test-aucpr-std | |
|---|---|---|---|---|
| 7 | 0.868235 | 0.000756 | 0.866994 | 0.001782 |
| 8 | 0.870861 | 0.000259 | 0.869497 | 0.002276 |
| 9 | 0.877161 | 0.001560 | 0.876323 | 0.002977 |
We can see that the multiplier has a massive effect on the aucpr score.
Next, we'll look to optimize the model hyperparameters. As we do this, our experiments will be tracked using MLflow.
import dataiku
project = dataiku.api_client().get_default_project()
managed_folder = project.get_managed_folder('lV6oqreY')
The function below defines the HP space to explore (parameters and their ranges), focusing on 5 such parameters with known strong effect on model performance and regularization:
def objective(trial):
"""
Capture a single param combination and model fitting,
evaluated using cross-validation.
"""
param = BASE_PARAMS.copy()
param.update({
"subsample": trial.suggest_float("subsample", 0.2, 1.0), # default 1 - all rows
"colsample_bytree": trial.suggest_float("colsample_bytree", 0.2, 1.0), # default 1 - all columns
"n_estimators": trial.suggest_int("n_estimators", 10, 100, step=10), # default 100
"max_depth": trial.suggest_int("max_depth", 2, 20, step=2), # default 3
"multiplier": trial.suggest_int("multiplier", 1, 50)
})
with mlflow_handle.start_run(run_name="trial", nested=True):
result = cross_val_score_xgb(param)
best_score = result["test-aucpr-mean"].values[-1]
# logging
mlflow_handle.log_params(param)
mlflow_handle.log_metrics(
{
'best_score': best_score
}
)
return best_score
Finally, we are ready to run our study, currently consisting of 40 trials:
import optuna
N_TRIALS = 40
with project.setup_mlflow(managed_folder=managed_folder) as mlflow_handle:
mlflow_handle.set_experiment("xgboost_hp_tuning")
with mlflow_handle.start_run(run_name="study", nested=True) as study_run:
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=N_TRIALS, timeout=600)
# logging
best_params = study.best_trial.params
mlflow_handle.log_params(best_params)
mlflow_handle.log_metrics(
{
'best_score': study.best_trial.value
}
)
# refit best model
print("Fitting best model...")
model = XGBClassifier(**study.best_trial.params)
model = model.fit(X_train, y_train)
ERROR:dataikuapi.dss.project:MLflow >= 3 (detected: 3.5.1) is unsupported and may cause compatibility issues
[I 2025-11-03 20:33:41,607] A new study created in memory with name: no-name-b66d0f75-de1f-42b5-be19-f1cbedffdab2
[I 2025-11-03 20:33:46,880] Trial 0 finished with value: 0.9353716859134229 and parameters: {'subsample': 0.6899896335213391, 'colsample_bytree': 0.47441926305090054, 'n_estimators': 50, 'max_depth': 12, 'multiplier': 18}. Best is trial 0 with value: 0.9353716859134229.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_WjZ 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:33:53,332] Trial 1 finished with value: 0.7688795175149528 and parameters: {'subsample': 0.9894126260211467, 'colsample_bytree': 0.9633812882041866, 'n_estimators': 70, 'max_depth': 12, 'multiplier': 3}. Best is trial 0 with value: 0.9353716859134229.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_Jxn 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:34:01,511] Trial 2 finished with value: 0.9611255382988323 and parameters: {'subsample': 0.4997494510293741, 'colsample_bytree': 0.36387740633944604, 'n_estimators': 70, 'max_depth': 18, 'multiplier': 39}. Best is trial 2 with value: 0.9611255382988323.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_sbC 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:34:04,241] Trial 3 finished with value: 0.974544127423603 and parameters: {'subsample': 0.7469359797305992, 'colsample_bytree': 0.4354031041512033, 'n_estimators': 30, 'max_depth': 8, 'multiplier': 45}. Best is trial 3 with value: 0.974544127423603.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_3b7 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:34:05,937] Trial 4 finished with value: 0.9677910820395811 and parameters: {'subsample': 0.9877253726617672, 'colsample_bytree': 0.6923529306571498, 'n_estimators': 10, 'max_depth': 12, 'multiplier': 40}. Best is trial 3 with value: 0.974544127423603.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_tEg 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:34:16,324] Trial 5 finished with value: 0.971792378679984 and parameters: {'subsample': 0.8398611344898694, 'colsample_bytree': 0.4718609368050369, 'n_estimators': 100, 'max_depth': 12, 'multiplier': 50}. Best is trial 3 with value: 0.974544127423603.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_9sL 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:34:24,493] Trial 6 finished with value: 0.8751990527666367 and parameters: {'subsample': 0.5119119066248882, 'colsample_bytree': 0.7741980763674625, 'n_estimators': 60, 'max_depth': 20, 'multiplier': 10}. Best is trial 3 with value: 0.974544127423603.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_pxb 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:34:27,799] Trial 7 finished with value: 0.9718267503547336 and parameters: {'subsample': 0.9615664077715051, 'colsample_bytree': 0.5412624854415858, 'n_estimators': 30, 'max_depth': 12, 'multiplier': 44}. Best is trial 3 with value: 0.974544127423603.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_HQw 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:34:37,551] Trial 8 finished with value: 0.911875484316195 and parameters: {'subsample': 0.6062526781694673, 'colsample_bytree': 0.7597052614029767, 'n_estimators': 80, 'max_depth': 16, 'multiplier': 15}. Best is trial 3 with value: 0.974544127423603.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_kRz 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:34:43,415] Trial 9 finished with value: 0.9230100973482159 and parameters: {'subsample': 0.7521154962937127, 'colsample_bytree': 0.5731196558633411, 'n_estimators': 100, 'max_depth': 2, 'multiplier': 13}. Best is trial 3 with value: 0.974544127423603.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_ERn 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:34:45,785] Trial 10 finished with value: 0.9613541271353071 and parameters: {'subsample': 0.32700115683853564, 'colsample_bytree': 0.20837589023727326, 'n_estimators': 30, 'max_depth': 4, 'multiplier': 31}. Best is trial 3 with value: 0.974544127423603.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_RQP 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:34:48,656] Trial 11 finished with value: 0.9775087544919833 and parameters: {'subsample': 0.8446487571076141, 'colsample_bytree': 0.3223747630122027, 'n_estimators': 30, 'max_depth': 6, 'multiplier': 50}. Best is trial 11 with value: 0.9775087544919833.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_obD 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:34:51,289] Trial 12 finished with value: 0.9773659865712946 and parameters: {'subsample': 0.7901787560314834, 'colsample_bytree': 0.2906561940975648, 'n_estimators': 30, 'max_depth': 6, 'multiplier': 50}. Best is trial 11 with value: 0.9775087544919833.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_uNE 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:34:52,701] Trial 13 finished with value: 0.9537109439818269 and parameters: {'subsample': 0.8445515756795964, 'colsample_bytree': 0.24199615663584614, 'n_estimators': 10, 'max_depth': 6, 'multiplier': 30}. Best is trial 11 with value: 0.9775087544919833.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_8EB 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:34:56,259] Trial 14 finished with value: 0.9766424308967175 and parameters: {'subsample': 0.8589941839477029, 'colsample_bytree': 0.34114557985515154, 'n_estimators': 40, 'max_depth': 8, 'multiplier': 49}. Best is trial 11 with value: 0.9775087544919833.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_62f 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:34:57,881] Trial 15 finished with value: 0.9618996347232495 and parameters: {'subsample': 0.24948705856000786, 'colsample_bytree': 0.31650818221517607, 'n_estimators': 20, 'max_depth': 2, 'multiplier': 36}. Best is trial 11 with value: 0.9775087544919833.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_kjm 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:35:02,105] Trial 16 finished with value: 0.9599773041440316 and parameters: {'subsample': 0.6435489722631182, 'colsample_bytree': 0.28412720413915266, 'n_estimators': 50, 'max_depth': 8, 'multiplier': 27}. Best is trial 11 with value: 0.9775087544919833.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_pHN 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:35:05,295] Trial 17 finished with value: 0.9538795064202849 and parameters: {'subsample': 0.8857992208712975, 'colsample_bytree': 0.40169243820808864, 'n_estimators': 40, 'max_depth': 6, 'multiplier': 22}. Best is trial 11 with value: 0.9775087544919833.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_PMZ 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:35:07,146] Trial 18 finished with value: 0.966799517656136 and parameters: {'subsample': 0.5186238879477051, 'colsample_bytree': 0.9836311000494069, 'n_estimators': 20, 'max_depth': 4, 'multiplier': 35}. Best is trial 11 with value: 0.9775087544919833.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_fZX 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:35:10,422] Trial 19 finished with value: 0.9749270232552191 and parameters: {'subsample': 0.7429938588833782, 'colsample_bytree': 0.6697068706500071, 'n_estimators': 40, 'max_depth': 6, 'multiplier': 44}. Best is trial 11 with value: 0.9775087544919833.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_AQl 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:35:12,652] Trial 20 finished with value: 0.9736326885053185 and parameters: {'subsample': 0.42416158099069246, 'colsample_bytree': 0.21649430838781042, 'n_estimators': 20, 'max_depth': 10, 'multiplier': 50}. Best is trial 11 with value: 0.9775087544919833.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_oLh 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:35:16,350] Trial 21 finished with value: 0.9771416274910972 and parameters: {'subsample': 0.8940305872902186, 'colsample_bytree': 0.3522908700669355, 'n_estimators': 40, 'max_depth': 8, 'multiplier': 50}. Best is trial 11 with value: 0.9775087544919833.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_rOU 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:35:19,342] Trial 22 finished with value: 0.9750814180109693 and parameters: {'subsample': 0.9125513174296409, 'colsample_bytree': 0.2827240118459643, 'n_estimators': 40, 'max_depth': 4, 'multiplier': 45}. Best is trial 11 with value: 0.9775087544919833.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_zuj 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:35:22,636] Trial 23 finished with value: 0.9710030850898107 and parameters: {'subsample': 0.802555039818978, 'colsample_bytree': 0.3861781381783014, 'n_estimators': 30, 'max_depth': 10, 'multiplier': 41}. Best is trial 11 with value: 0.9775087544919833.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_Q7W 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:35:27,662] Trial 24 finished with value: 0.9754766143938752 and parameters: {'subsample': 0.9250802083481584, 'colsample_bytree': 0.5117315288154074, 'n_estimators': 60, 'max_depth': 8, 'multiplier': 47}. Best is trial 11 with value: 0.9775087544919833.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_F9d 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:35:31,478] Trial 25 finished with value: 0.9696032659139046 and parameters: {'subsample': 0.8027636764053213, 'colsample_bytree': 0.29202634545830225, 'n_estimators': 50, 'max_depth': 6, 'multiplier': 35}. Best is trial 11 with value: 0.9775087544919833.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_2kD 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:35:34,226] Trial 26 finished with value: 0.9679905777433023 and parameters: {'subsample': 0.6647422906570697, 'colsample_bytree': 0.3921207902472209, 'n_estimators': 20, 'max_depth': 14, 'multiplier': 42}. Best is trial 11 with value: 0.9775087544919833.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_F2G 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:35:35,521] Trial 27 finished with value: 0.9728017703787333 and parameters: {'subsample': 0.789457146764438, 'colsample_bytree': 0.6240610272580458, 'n_estimators': 10, 'max_depth': 4, 'multiplier': 50}. Best is trial 11 with value: 0.9775087544919833.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_j4E 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:35:38,515] Trial 28 finished with value: 0.9692002402974532 and parameters: {'subsample': 0.7164288286847571, 'colsample_bytree': 0.25735571412213787, 'n_estimators': 30, 'max_depth': 10, 'multiplier': 38}. Best is trial 11 with value: 0.9775087544919833.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_EmH 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:35:41,646] Trial 29 finished with value: 0.9502002700484943 and parameters: {'subsample': 0.926574856924657, 'colsample_bytree': 0.4729451044019096, 'n_estimators': 50, 'max_depth': 2, 'multiplier': 23}. Best is trial 11 with value: 0.9775087544919833.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_vVi 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:35:45,119] Trial 30 finished with value: 0.9749444144242561 and parameters: {'subsample': 0.7012794623873662, 'colsample_bytree': 0.4345799743054106, 'n_estimators': 40, 'max_depth': 8, 'multiplier': 46}. Best is trial 11 with value: 0.9775087544919833.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_gfr 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:35:48,606] Trial 31 finished with value: 0.9762393132334699 and parameters: {'subsample': 0.8368386666105487, 'colsample_bytree': 0.3365014100666121, 'n_estimators': 40, 'max_depth': 8, 'multiplier': 48}. Best is trial 11 with value: 0.9775087544919833.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_pe4 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:35:52,421] Trial 32 finished with value: 0.6191459079402236 and parameters: {'subsample': 0.8841100396220483, 'colsample_bytree': 0.34787054096011916, 'n_estimators': 50, 'max_depth': 6, 'multiplier': 1}. Best is trial 11 with value: 0.9775087544919833.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_Zzc 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:35:56,189] Trial 33 finished with value: 0.9758798932456503 and parameters: {'subsample': 0.8716357422676627, 'colsample_bytree': 0.340820129392932, 'n_estimators': 40, 'max_depth': 10, 'multiplier': 50}. Best is trial 11 with value: 0.9775087544919833.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_Ibm 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:35:59,827] Trial 34 finished with value: 0.96895516427647 and parameters: {'subsample': 0.996776933504578, 'colsample_bytree': 0.8849954724612084, 'n_estimators': 30, 'max_depth': 14, 'multiplier': 42}. Best is trial 11 with value: 0.9775087544919833.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_44z 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:36:04,612] Trial 35 finished with value: 0.9753456782436641 and parameters: {'subsample': 0.7850773708752442, 'colsample_bytree': 0.4027960267685667, 'n_estimators': 60, 'max_depth': 8, 'multiplier': 47}. Best is trial 11 with value: 0.9775087544919833.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_E2N 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:36:09,694] Trial 36 finished with value: 0.9726538828412625 and parameters: {'subsample': 0.9482889262243964, 'colsample_bytree': 0.4320341042117158, 'n_estimators': 70, 'max_depth': 6, 'multiplier': 39}. Best is trial 11 with value: 0.9775087544919833.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_H1g 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:36:11,517] Trial 37 finished with value: 0.975031021457348 and parameters: {'subsample': 0.8420705556649924, 'colsample_bytree': 0.517418455570771, 'n_estimators': 20, 'max_depth': 4, 'multiplier': 47}. Best is trial 11 with value: 0.9775087544919833.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_ubf 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:36:14,382] Trial 38 finished with value: 0.9731217598182763 and parameters: {'subsample': 0.5764444728259469, 'colsample_bytree': 0.25123046027932716, 'n_estimators': 30, 'max_depth': 8, 'multiplier': 44}. Best is trial 11 with value: 0.9775087544919833.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_zz3 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
[I 2025-11-03 20:36:18,092] Trial 39 finished with value: 0.9759757427756962 and parameters: {'subsample': 0.9640774078776057, 'colsample_bytree': 0.30967596069052383, 'n_estimators': 40, 'max_depth': 10, 'multiplier': 50}. Best is trial 11 with value: 0.9775087544919833.
🏃 View run trial at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/trial_nSU 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning Fitting best model... 🏃 View run study at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning/runs/study_hRP 🧪 View experiment at: https://dss-headless-node-d7ee5be4.space-17c5f1a9-dku.svc.cluster.local:10005/dip/publicapi/#/experiments/xgboost_hp_tuning
Lets see the best results:
print("Best trial:")
trial = study.best_trial
print(" Value: {}".format(trial.value))
print(" Params: ")
for key, value in trial.params.items():
print(" {}: {}".format(key, value))
Best trial:
Value: 0.9775087544919833
Params:
subsample: 0.8446487571076141
colsample_bytree: 0.3223747630122027
n_estimators: 30
max_depth: 6
multiplier: 50
Let's see how the HP choices impacted performance:
study_df = study.trials_dataframe()
study_df.head()
| number | value | datetime_start | datetime_complete | duration | params_colsample_bytree | params_max_depth | params_multiplier | params_n_estimators | params_subsample | state | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0.935372 | 2025-11-03 20:33:41.609822 | 2025-11-03 20:33:46.880396 | 0 days 00:00:05.270574 | 0.474419 | 12 | 18 | 50 | 0.689990 | COMPLETE |
| 1 | 1 | 0.768880 | 2025-11-03 20:33:46.880979 | 2025-11-03 20:33:53.332261 | 0 days 00:00:06.451282 | 0.963381 | 12 | 3 | 70 | 0.989413 | COMPLETE |
| 2 | 2 | 0.961126 | 2025-11-03 20:33:53.333083 | 2025-11-03 20:34:01.510973 | 0 days 00:00:08.177890 | 0.363877 | 18 | 39 | 70 | 0.499749 | COMPLETE |
| 3 | 3 | 0.974544 | 2025-11-03 20:34:01.511564 | 2025-11-03 20:34:04.241744 | 0 days 00:00:02.730180 | 0.435403 | 8 | 45 | 30 | 0.746936 | COMPLETE |
| 4 | 4 | 0.967791 | 2025-11-03 20:34:04.242438 | 2025-11-03 20:34:05.937438 | 0 days 00:00:01.695000 | 0.692353 | 12 | 40 | 10 | 0.987725 | COMPLETE |
We will look at different projections of the HP space and the best observed values:
import plotly.express as px
import plotly.offline as pyo
pyo.init_notebook_mode()
pivot = pd.pivot_table(study_df, index="params_max_depth", columns="params_n_estimators", values="value", aggfunc='max')
fig = px.imshow(
pivot,
color_continuous_scale="blues",
title="Best values across combinations"
)
fig.show()
The best performing models were found with the mid-to-higher range of boosting rounds and lower max depth (the latter help avoid overfitting if the number of estimators is high).
pivot = pd.pivot_table(study_df, index="params_max_depth", columns="params_colsample_bytree", values="value", aggfunc='max')
fig = px.imshow(
pivot,
color_continuous_scale="blues",
title="Best values across combinations",
)
fig.update_xaxes(
scaleanchor="x",
)
fig.show()
In our experiemnts, the high scores also corresponded with smaller col samples (how many col's each estimater used) unless max depth was singificantly lowered. The small col sample again helps avoid overfitting although the patterns are maybe less clear.
pivot = pd.pivot_table(study_df, index="params_n_estimators", columns="params_colsample_bytree", values="value", aggfunc='max')
fig = px.imshow(
pivot,
color_continuous_scale="blues",
title="Best values across combinations",
)
fig.update_xaxes(
scaleanchor="x",
)
fig.show()
While the patterns might not be the most clear here, we can see that having high boosting rounds and high sample leads to lower scores (the bottom right corner, likely overfitting again).
Given that some of the best results were observed at the end of the specified search range, it would be a good next step to extend the range further, potentially with a larger step size for boosting rounds.
Finally, we look at the multiplier effect:
pd.pivot_table(study_df, index="params_multiplier", values="value", aggfunc='mean').T
| params_multiplier | 1 | 3 | 10 | 13 | 15 | 18 | 22 | 23 | 27 | 30 | 31 | 35 | 36 | 38 | 39 | 40 | 41 | 42 | 44 | 45 | 46 | 47 | 48 | 49 | 50 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| value | 0.619146 | 0.76888 | 0.875199 | 0.92301 | 0.911875 | 0.935372 | 0.95388 | 0.9502 | 0.959977 | 0.953711 | 0.961354 | 0.968201 | 0.9619 | 0.9692 | 0.96689 | 0.967791 | 0.971003 | 0.968473 | 0.973292 | 0.974813 | 0.974944 | 0.975284 | 0.976239 | 0.976642 | 0.975262 |
On average, the higher the multiplier the better the aucpr score we got which is also show in the heatmaps below. It looks like we get the most benefit around >40 weighting.
pivot = pd.pivot_table(study_df, index="params_multiplier", columns="params_n_estimators", values="value", aggfunc='max')
fig = px.imshow(
pivot,
color_continuous_scale="blues",
title="Best values across combinations"
)
fig.show()
This patter is nicely shown the heatmap above and below as well.
pivot = pd.pivot_table(study_df, index="params_multiplier", columns="params_max_depth", values="value", aggfunc='max')
fig = px.imshow(
pivot,
color_continuous_scale="blues",
title="Best values across combinations"
)
fig.show()
We save the predicted class and probabilities both calculated:
predictions_learn_df = pd.DataFrame(
{
TARGET: y_train,
'pred': model.predict(X_train),
'pred_proba': model.predict_proba(X_train)[:, 1]
}
)
predictions_test_df = pd.DataFrame(
{
TARGET: y_test,
'pred': model.predict(X_test),
'pred_proba': model.predict_proba(X_test)[:, 1]
}
)
Finally lets look at the feature importances for our model too (top 20):
fig = pd.DataFrame(
{
'importance': model.feature_importances_,
},
index=model.feature_names_in_
).sort_values('importance').tail(20).plot(kind="bar", backend='plotly')
fig.show()
Observations:
All these findings align with our expectations and EDA. Our model picked up on the gender bias in our data (there are much more high earner males in the dataset than female) which can definitely be addressed in future model iterations - please see the slides for more info.
# gender imbalance
processed_learn_df.groupby(TARGET).mean("sex_Male")["sex_Male"]
income 0 0.458790 1 0.785714 Name: sex_Male, dtype: float64
79% of high income earners were male, as opposed to 46% of low income. This statistical inparity is a strong signal for the model to pick up on and use for classification.
We finally save the results to their own datasets which can be used for evaluation:
# Write recipe outputs
predictions_learn = dataiku.Dataset("predictions_learn")
predictions_learn.write_with_schema(predictions_learn_df)
predictions_test = dataiku.Dataset("predictions_test")
predictions_test.write_with_schema(predictions_test_df)
/opt/dataiku/python/dataiku/core/schema_handling.py:68: DeprecationWarning: is_datetime64tz_dtype is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.DatetimeTZDtype)` instead.
152807 rows successfully written (k5ZHid5KGm)
/opt/dataiku/python/dataiku/core/schema_handling.py:68: DeprecationWarning: is_datetime64tz_dtype is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.DatetimeTZDtype)` instead.
78826 rows successfully written (ErGTqOEJ2x)
study_data = dataiku.Dataset("xgboost_study")
study_data.write_with_schema(study_df)
/opt/dataiku/python/dataiku/core/schema_handling.py:68: DeprecationWarning: is_datetime64tz_dtype is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.DatetimeTZDtype)` instead.
40 rows successfully written (BEhokHh8M8)